Skip to content

PERF: further improve take (reindex/unstack) for ArrayManager #40300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 15, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 8, 2021

A few related changes to take_1d / ArrayManager.unstack:

  • Check in ArrayManager.unstack whether there will be missing values introduced or not, and if not, pass allow_fill=False to the take function, so this can take a more performant path (and doesn't need to check repeatedly for indexer == -1). And the same for ArrayManager.reindex_indexer.
  • In the take_1d, if allow_fill=False, use the numpy take instead of our own cython implementation (numpy is a bit faster).
  • Move the ensure_int64 check for the indexer out of the shared _take_preprocess_indexer_and_fill_value. This way, we can do this once in ArrayManager.unstack and avoid doing it repeatedly for the same indexer inside take_1d.

Case from the frame_methods.py::Unstack.time_full_product ASV benchmark:

dtype = 'int'
m = 100
n = 1000

levels = np.arange(m)
index = pd.MultiIndex.from_product([levels] * 2)
columns = np.arange(n)
values = np.arange(m * m * n).reshape(m * m, n)

df = pd.DataFrame(values, index, columns)
df_am = df._as_manager("array")

with the combined changes, this gives:

In [2]: %timeit df_am.unstack()
1.09 s ± 68.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master
276 ms ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- PR

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 8, 2021
Comment on lines +485 to +500
# check for promotion based on types only (do this first because
# it's faster than computing a mask)
dtype, fill_value = maybe_promote(arr.dtype, fill_value)
if dtype != arr.dtype and (out is None or out.dtype != dtype):
# check if promotion is actually required based on indexer
mask = indexer == -1
needs_masking = mask.any()
mask_info = mask, needs_masking
if needs_masking:
if out is not None and out.dtype != dtype:
raise TypeError("Incompatible type for fill_value")
else:
# if not, then depromote, set fill_value to dummy
# (it won't be used but we don't want the cython code
# to crash when trying to cast it to dtype)
dtype, fill_value = arr.dtype, arr.dtype.type()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed in those lines of the code; it's only a change in indentation (which github does't seem to detect nicely).

Only the first lines:

    if indexer is None:
        indexer = np.arange(arr.shape[axis], dtype=np.int64)
        dtype, fill_value = arr.dtype, arr.dtype.type()
    else:
        indexer = ensure_int64(indexer, copy=False)

have been moved out into _take_nd_ndarray

Comment on lines +85 to +89
if indexer is None:
indexer = np.arange(arr.shape[axis], dtype=np.int64)
dtype, fill_value = arr.dtype, arr.dtype.type()
else:
indexer = ensure_int64(indexer, copy=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is copied from _take_preprocess_indexer_and_fill_value, so we can avoid doing that in take_1d

@@ -999,11 +1000,16 @@ def _reindex_indexer(

else:
validate_indices(indexer, len(self._axes[0]))
indexer = ensure_int64(indexer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you converting to int64? we use intp for indexing via take

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's existing code (just cut and paste). And our cython take algos always use int64:

const int64_t[:] indexer,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should they though? i expect they ought to be intp_t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im increasingly confident that intp_t is the correct thing to do, but its out of scope for this PR, xref #40390

if (indexer == -1).any():
allow_fill = True
else:
allow_fill = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche in looking at the benchmarks you posted in #39146 it seems like some of them involve optimizations implemented for AM that could be implemented for BM but weren't. is there anything here that falls into that category?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in looking at the benchmarks you posted in #39146 it seems like some of them involve optimizations implemented for AM that could be implemented for BM but weren't

Are you thinking of something specific?

The optimization I do here above can probably be useful if you have lots of non-consolidated blocks, but for the benchmark cases with single blocks posted in #39146, that wouldn't make a difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series has an algos.take_nd call i think could be take_1d. Also several of the EA.take implementations

@jorisvandenbossche
Copy link
Member Author

Are there further comments here?

indexer, dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(
arr, indexer, 0, None, fill_value, allow_fill
arr, indexer, None, fill_value, allow_fill
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever get to _take_preprocess_indexer_and_fill_value with allow_fill=False?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's currently possible in take_nd, and a quick search reveals that we have a few cases throughout pandas that calls take_nd specifying this argument. That's probably something to optimize though.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2021

yeah i think generally we should be using intp for indexing as we
do this for numpy

we should change our cython routines to do this (this maybe a non trivial ask)

it is likely this is a perf bottleneck (as we are moving between intp and int64 a fair amount)

i would be wrong though and this is not really an issue

@jreback
Copy link
Contributor

jreback commented Mar 11, 2021

checks are failing (maybe just rebase needed)

@jbrockmendel
Copy link
Member

it is likely this is a perf bottleneck (as we are moving between intp and int64 a fair amount)

arent these aliases on 64bit builds?

@jbrockmendel
Copy link
Member

Can/should we just have take_nd do an earlier check for ndim == 1 and then dispatch to take_1d?

In NDArrayBackedExtensionArray.take if i change take_nd to

        if self.ndim == 1 and axis == 0 and allow_fill:
            indices = ensure_platform_int(indices)
            validate_indices(indices, len(self))
            new_data = take_1d(
                self._ndarray, indices, allow_fill=allow_fill, fill_value=fill_value
            )

        else:
             new_data = [what we do right now]

then on the time_full_product benchmark referenced #39146 (comment) we shave off 25-30% relative to master (non-AM)

@jorisvandenbossche
Copy link
Member Author

Can/should we just have take_nd do an earlier check for ndim == 1 and then dispatch to take_1d?

That might be possible (although the check for EA and np.asarray might get a bit in the way for that). But we would still have the separate take_1d implemention, so I personally prefer to keep using that for explicitness in those places where we know it's only about 1D arrays (like the places in ArrayManager where I am using it now).

@jbrockmendel
Copy link
Member

so I personally prefer to keep using that for explicitness in those places where we know it's only about 1D arrays (like the places in ArrayManager where I am using it now).

im fine with that, just suggesting if we have a clear optimization for the general take lets use it

@jorisvandenbossche
Copy link
Member Author

im fine with that, just suggesting if we have a clear optimization for the general take lets use it

Certainly, see #40405

@jorisvandenbossche
Copy link
Member Author

For the rest, any remaining comments before merging this?

@jbrockmendel
Copy link
Member

ill take another look this afternoon

new_indexer2D = new_indexer.reshape(*unstacker.full_shape)
new_indexer2D = ensure_int64(new_indexer2D)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this chunk of code be made to match Block._unstack (or vice-versa) more closely?

e.g. on L1094 you check unstacker.mask.all() which I think is related to the comment in Block._unstack # TODO: in all tests we have mask.all(); can we rely on that? but bc mask is defined there differently it is not obvious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you are referring to the EA-specific Block.unstack (as the other one has even less in common): that's for a single column while this is for multiple columns at once, and needs to deal with creating blocks, with placement, etc. So personally I don't see much possibilities to share code.

e.g. on L1094 you check unstacker.mask.all() which I think is related to the comment in Block._unstack

As far as I understand the Block version, I don't think that's related. For Block/BlockManager.unstack, that mask value somehow indicates which of the columns need to end up in the result (I don't really understand that code, though, as in the ArrayManager I am simply using the result of unstacker.get_new_columns without any filtering of that afterwards), while here I only check the mask to potentially use a faster take for performance-reasons, but that doesn't influence the end result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's related

OK.

(I don't really understand that code, though

at some point we should make sure that one of us (active maintainers, not just the two of us) have a handle on this.

@jreback jreback added this to the 1.3 milestone Mar 12, 2021
@jreback
Copy link
Contributor

jreback commented Mar 12, 2021

lgtm @jbrockmendel merge when ready

@jbrockmendel
Copy link
Member

LGTM. Merging this evening absent other comments.

@jreback jreback merged commit 1ced878 into pandas-dev:master Mar 15, 2021
@jreback
Copy link
Contributor

jreback commented Mar 15, 2021

thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the am-perf-take-2 branch March 15, 2021 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants